7 research outputs found

    Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases.</p> <p>Results</p> <p>Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%).</p> <p>Conclusions</p> <p>Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.</p

    Modèles vectoriels et bibliométriques pour la recherche d'information et la détection de nouveauté appliqués à la protéomique

    No full text
    Ce mémoire débute par un état de l'art couvrant les problématiques modernes de la fouille de données textuelles et de la bilbliométrie, appliquées à la mise à jour de l'information dans les bases de données de biologie moléculaire, dont une source d'information importante est la bibliothèque numérique MEDLINE. Puis, nous montrons comme l'information argumentative - extraite via un classificateur bayésien - peut significativement améliorer la précision d'un moteur dans une tâche de recherche d'articles similaires. Pour l'évaluation de cette expérience, nous utilisons des jugements de pertinence extraits automatiquement de réseaux de citations bibliographiques. Finalement, nous proposons une extension du modèle classique de recherche d'information appliqué à la littérature en modélisant une tâche de mise à jour d'information telle que réalisée par les annotateurs de la base de données Swiss-Prot. L'information textuelle y est enrichie d'informations topologiques spécifiques aux réseaux de citations bibliographiques pour des gains en précision de l'ordre de 10%

    From Episodes of Care to Diagnosis Codes: Automatic Text Categorization for Medico-Economic Encoding

    No full text
    We report on the design and evaluation of an original system to help assignment ICD (International Classification of Disease) codes to clinical narratives. The task is defined as a multi-class multi-document classification task. We combine a set of machine learning and data-poor methods to generate a single automatic text categorizer, which returns a ranked list of ICD codes. The combined ranking system currently obtains a precision of 75% at high ranks and a recall of about 63% for the top twenty returned codes for a theoretical upper bound of about 79% (inter-coder agreement). The performance of the data-poor classifier is weak, whereas the use of temporal features such as anamnesis and prescription contents results in a statistically significant improvement

    Argumentative Feedback: A Linguistically-motivated Term Expansion for Information Retrieval

    No full text
    We report on the development of a new automatic feedback model to improve information retrieval in digital libraries. Our hypothesis is that some particular sentences, selected based on argumentative criteria, can be more useful than others to perform well-known feedback information retrieval tasks. The argumentative model we explore is based on four disjunct classes, which has been very regularly observed in scientific reports: PURPOSE, METHODS, RE-SULTS, CONCLUSION. To test this hypothesis, we use the Rocchio algorithm as baseline. While Rocchio selects the features to be added to the original query based on statistical evidence, we propose to base our feature selection also on argumentative criteria. Thus, we restrict the expansion on features appearing only in sentences classified into one of our argumentative categories. Our results, obtained on the OHSUMED collection, show a significant improvement when expansion is based on PURPOSE (mean average precision = +23%) and CONCLUSION (mean average precision = +41%) contents rather than on other argumentative contents. These results suggest that argumentation is an important linguistic dimension that could benefit information retrieval.

    Abstract Using Discourse Analysis to Improve Text Categorization in MEDLINE

    No full text
    studied in medical informatics in the context of the MEDLINE database, both for helping search in MEDLINE and in order to provide an indicative “gist ” of the content of an article. Automatic assignment of Medical Subject Headings (MeSH), which is formally an automatic text categorization task, has been proposed using different methods or combination of methods, including machine learning (naïve Bayes, neural networks…), linguistically-motivated methods (syntactic parsing, semantic tagging, or information retrieval. METHODS: In the present study, we propose to evaluate the impact of the argumentative structures of scientific articles to improve the categorization effectiveness of a categorizer, which combines linguistically-motivated and information retrieval methods. Our argumentative categorizer, which uses representation levels inherited from the field of discourse analysis, is able to classify sentences of an abstract in four classes: PURPOSE; METHODS; RESULTS and CONCLUSION. For the evaluation, the OHSUMED collection, a sample of MEDLINE, is used as a benchmark. For each abstract in the collection, the result of the argumentative classifier, i.e. the labeling of each sentence with an argumentative class, is used to modify the original ranking of the MeSH categorizer. RESULTS: The most effective combination (+2%, p&lt;0.003) strongly overweights the METHODS section and moderately the RESULTS and CONCLUSION section. CONCLUSION: Although modest, the improvement brought by argumentative features for text categorization confirms that discourse analysis methods could benefit text mining in scientific digital libraries

    Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction-0

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction"</p><p>http://www.biomedcentral.com/1471-2105/9/S3/S9</p><p>BMC Bioinformatics 2008;9(Suppl 3):S9-S9.</p><p>Published online 11 Apr 2008</p><p>PMCID:PMC2352866.</p><p></p
    corecore